This project explores housing data from Kaggle: Housing Price Competition.

Load the Data

Housing data as downloaded from Kaggle has quite a few missing values. For this project, all such columns have been dropped from the data set. Also, there were 62 variables in the original dataset. Final dataset used for the project has 1460 observations and 30 variables.

## 'data.frame':    1456 obs. of  30 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Here is a brief description of the variables that has been exlored in the project:

Univariate Plots

Explore general features

Sale price distribution

Since the distribution has slight positive skew, it is log transformed to get a normal distribution.

Sale price under different sale conditions

These are various conditions of sale as per documentation:

  • Normal: Normal Sale
  • Abnorml: Abnormal Sale - trade, foreclosure, short sale
  • AdjLand: Adjoining Land Purchase
  • Alloca: Allocation - two linked properties with separate deeds, typically condo with a garage unit
  • Family: Sale between family members
  • Partial: Home was not completed when last assessed (associated with New Homes)

While most of the sales were under normal conditions, sale of new houses which were partially complete fetched higher price.

Sale price under different sale type

These are various types of sale as per documentation:

  • WD Warranty Deed - Conventional
  • CWD Warranty Deed - Cash
  • VWD Warranty Deed - VA Loan
  • New Home just constructed and sold
  • COD Court Officer Deed/Estate
  • Con Contract 15% Down payment regular terms
  • ConLw Contract Low Down payment and low interest
  • ConLI Contract Low Interest
  • ConLD Contract Low Down
  • Oth Other

Most of the sales were conventional warranty deeds or homes that were just constructed and sold, with new homes sold at higher prices.

Distribution of sale price for different type of dwellings

Type of dwelling in the data has been defined in the data by two variables:

BldgType - which is general description of a building,

  • 1Fam: Single-family Detached
  • 2FmCon: Two-family Conversion; originally built as one-family dwelling
  • Duplx: Duplex
  • TwnhsE: Townhouse End Unit
  • TwnhsI: Townhouse Inside Unit

and MSSubClass that identifies the type of dwelling involved in the sale in more detail:

  • 20 1-STORY 1946 & NEWER ALL STYLES
  • 30 1-STORY 1945 & OLDER
  • 40 1-STORY W/FINISHED ATTIC ALL AGES
  • 45 1-1/2 STORY - UNFINISHED ALL AGES
  • 50 1-1/2 STORY FINISHED ALL AGES
  • 60 2-STORY 1946 & NEWER
  • 70 2-STORY 1945 & OLDER
  • 75 2-1/2 STORY ALL AGES
  • 80 SPLIT OR MULTI-LEVEL
  • 85 SPLIT FOYER
  • 90 DUPLEX - ALL STYLES AND AGES
  • 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
  • 150 1-1/2 STORY PUD - ALL AGES
  • 160 2-STORY PUD - 1946 & NEWER
  • 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
  • 190 2 FAMILY CONVERSION - ALL STYLES AND AGES

There are more of single-family detached building type, with most from 1946 or newer, single or double storeyed dwellings.

Density plot of sale price for all five building types

Houses with seemingly more privacy like Single-family Detached houses and Townhouse End Units fetched higher prices than the others. Also, although duplex houses in general weren’t as expensive, newer two storeyed houses were sold at higher price.

Sales as per the general zoning classification

MSZoning: Identifies the general zoning classification of the sale.

 * A    Agriculture
 * C    Commercial
 * FV   Floating Village Residential
 * I    Industrial
 * RH   Residential High Density
 * RL   Residential Low Density
 * RP   Residential Low Density Park 
 * RM   Residential Medium Density

Most of the houses are from residential low and medium density zones and houses from residential low density zones were pricier.

Sale price distribution for overall quality of the dwelling

OverallQual: Rates the overall material and finish of the house

  • 10 Very Excellent
  • 9 Excellent
  • 8 Very Good
  • 7 Good
  • 6 Above Average
  • 5 Average
  • 4 Below Average
  • 3 Fair
  • 2 Poor
  • 1 Very Poor

Most of the houses in the dataset are average and better in quality at the time of sale. And as expected, good quality houses have higher sale price.

Distribution of building age

Age of a building as perceived at the time of sale is taken as difference between year of sale and year of remodelling which has been taken to be same as construction date if no remodeling or additions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       4      14      23      41      60

Most of the properties in the dataset are of age 20 years or lesser. It will be interesting to explore the sale price of a building with respect to it’s age.

Total sales by month

Although there is a variation in total sales for each month of a year, distribution of sale price doesn’t vary much across the year.

Explore property specific features

Preferred neighbourhood

Let us see if there is any preference for some specific neighbourhoods.

Neighborhood: Physical locations within Ames city limits

  • Blmngtn Bloomington Heights
  • Blueste Bluestem
  • BrDale Briardale
  • BrkSide Brookside
  • ClearCr Clear Creek
  • CollgCr College Creek
  • Crawfor Crawford
  • Edwards Edwards
  • Gilbert Gilbert
  • IDOTRR Iowa DOT and Rail Road
  • MeadowV Meadow Village
  • Mitchel Mitchell
  • Names North Ames
  • NoRidge Northridge
  • NPkVill Northpark Villa
  • NridgHt Northridge Heights
  • NWAmes Northwest Ames
  • OldTown Old Town
  • SWISU South & West of Iowa State University
  • Sawyer Sawyer
  • SawyerW Sawyer West
  • Somerst Somerset
  • StoneBr Stone Brook
  • Timber Timberland
  • Veenker Veenker

There does seem to be distinct price preference for some neighbourhoods like, Northridge, Northridge Heights and Stone Brook. This preference will be explored later in bivariate and multivariate plots sections.

Distribution of lot area

Since the distribution of lot size distribution is highly skewed, log transformation of LotArea is done to get normalise the distribution.

General shape of property, contour and slope

LotShape: General shape of property

  • Reg Regular
  • IR1 Slightly irregular
  • IR2 Moderately Irregular
  • IR3 Irregular

LandContour: Flatness of the property

  • Lvl Near Flat/Level
  • Bnk Banked - Quick and significant rise from street grade to building
  • HLS Hillside - Significant slope from side to side
  • Low Depression

LandSlope: Slope of property

  • Gtl Gentle slope
  • Mod Moderate Slope
  • Sev Severe Slope

Most of the properties are regular shaped or slightly irregular, with gentle slope and level contour. Interestingly, moderately irregular plots were sold at higher price.

Type of utilities available

Utilities: Type of utilities available

  • AllPub All public Utilities (E,G,W,& S)
  • NoSewr Electricity, Gas, and Water (Septic Tank)
  • NoSeWa Electricity and Gas Only
  • ELO Electricity only

Most of the property have all the utilities available.

Distribution of Sale Price as per lot configuration

LotConfig: Lot configuration as defined in the dataset

  • Inside: Inside lot
  • Corner: Corner lot
  • CulDSac: Cul-de-sac
  • FR2: Frontage on 2 sides of property
  • FR3: Frontage on 3 sides of property

Most of the lots were either inside or corner lots. Also, houses in cul-de-sacs and the houses with three side frontage are pricier.

Does access street have any bearing on the house price?

Street: Type of road access to property

  • Grvl: Gravel
  • Pave: Paved

Most of the houses have paved street access and as expected houses with paved access fetch higher price than ones with gravel access.

Explore structure specific features

Above grade living area distribution

Ground living area distribution is normalised by log transformation before plotting.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1128    1458    1507    1775    3627

Garage area distribution

Garage area distribution is normalised by log transformation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   329.5   478.5   471.6   576.0  1390.0

Basement area distribution

Basement area distribution is normalised by log transformation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.0   990.5  1050.7  1293.8  3206.0

Sale price distribution by number of rooms in the house

More rooms could mean bigger houses hence higher prices.

Distribution of Sale Price and Ground Living Area For Different Roofing Style

There are different type of roof defined for houses:

  • Flat
  • Gable
  • Gambrel: Gabrel (Barn)
  • Hip
  • Mansard
  • Shed

Most of the houses had either gabled or hipped roofs, with some of these houses having higher ground living area. Looking at the sale price distribtion, some of the houses with these roofs did fetch higher price too. A combined plot of sale price with groung level area can reveal if higher area did correspond to higher price, which will be explored in the bivariate plots section.

Sale price as per kitchen quality

KitchenQual: Kitchen quality

  • Ex Excellent
  • Gd Good
  • TA Typical/Average
  • Fa Fair
  • Po Poor

While most of the houses had average or good kitchen, kitchen in excellent condition did influence the sale price of the house.

Sale price and bathrooms

While bathroom to bedroom ratio does seem to influence the sale price to some extent, too many bathrooms per room don’t seem to matter much.

Univariate Analysis

The tidy housing dataset used for the project has 1456 observations and 31 variables.

Main features of interest are in the dataset are:

Some of the features which might help with investigation of the main features are:

Apart from these, others like number of bathrooms, lot shape and configutaion also seem to have some influence on the sale price. Also, it would be important to explore if any of these features are correlated.

Following new variables very created from existing variables in the dataset:

Following features had skewed distribution, which has been normalised using log tranformation:

Normalising a data is imperative before applying any statistical analysis, as most of the statistical tests assume the data to be normally distributed.

Bivariate Plots Section

Correlation between numeric features

While, as expected, OverallQual is very highly correlated with the SalePrice, some features like GrLivArea, GarageArea, TotalBsmtSF also have good correlation with the target variable. Others like YearRemodAdd, YearBuilt can also be good predictor of the SalePrice of the property. Also there are some features which are correlated to each other, such as YearRemodAdd with BldgAge and GrLivArea with TotRmsAbvGrd.

Explore sale price with respect to the lot area

There is a correlation of 0.4 between sale price of the house and it’s lot area.

Explore sale price with respect to the ground living area

There is a definite correlation of 0.73 between sale price of the house and it’s living area.

Explore sale price with respect to the garage area

There is a correlation of 0.46 between sale price of the house and it’s garage area.

Explore sale price with respect to the basement area

There is a correlation of 0.37 between sale price of the house and it’s basement area.

Overall quality with respect to year of built

There does seem to be a positive correlation between overall quality of the property and the year it was built, with more recent property having better overall quality as compared to older sructures. There were few older houses which were in good condition.

Overall quality and neighbourhood

More percentage of houses in neighbourhoods like Northridge, Northridge Heights, Somerset, Stone Brook and Bloomington Heights were of better quality. Interestingly though, as noticed in univariate analysis, Northridge, Northridge Heights and Stone Brook were pricier than others, inspite of good quality houses in other neighbourhoods.

What about the distribution of new and old buildings in different neighbourhoods?

Let us consider all buildings older than 20 years as ‘Old’.

Northridge, Northridge Heights, Somerset and Bloomington Heights had all new houses, while more than 75% houses in College Creek, Gilbert and Stone Brook were new.

Examine neighbourhood affluence and age of the building

It was observed earlier that some of the neighbourhoods had higher priced houses as compared to others. May be some of the neighbourhoods have old houses and people don’t prefer them? Let us divide the neighbourhood into three groups, affluent(Level 1) with average sale price above 250,000, not so affluent(Level 2) with average sale price between 250,000 to 150,000 and not affluent(Level 3) with average sale price below 150,000, and take note at the age of the building, at the time of the sale, in each group.

Neighborhood affluence
Blmngtn Level 2
Blueste Level 3
BrDale Level 3
BrkSide Level 3
ClearCr Level 2
CollgCr Level 2
Crawfor Level 2
Edwards Level 3
Gilbert Level 2
IDOTRR Level 3
MeadowV Level 3
Mitchel Level 2
NAmes Level 3
NoRidge Level 1
NPkVill Level 3
NridgHt Level 1
NWAmes Level 2
OldTown Level 3
Sawyer Level 3
SawyerW Level 2
Somerst Level 2
StoneBr Level 1
SWISU Level 3
Timber Level 2
Veenker Level 2
## df$affluence: Level 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  154000  254000  302000  314627  367294  625000 
## -------------------------------------------------------- 
## df$affluence: Level 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   76000  165000  190000  200649  230000  424870 
## -------------------------------------------------------- 
## df$affluence: Level 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  110000  130000  132525  149000  475000

Most of the older houses were sold below 200K US$ with a marked dip in sale price for houses beyond 20 years. As expected more affluent neighbourhoods have newer houses as compared to others, although some older houses also did seem to fetch good price. Correlation between sale price and the age of the buiilding is -0.52.

We can look at the sale price distribution of the new and old buildings separately, with buildings above 20 years of age marked as “Old”.

Correlation of sale price with building age for new buildings is -0.18, while for older buildings it is -0.4.

Let us check if proximity to certain amenities affect neighbourhoods affluence.

Condition1: Proximity to various conditions

  • Artery Adjacent to arterial street
  • Feedr Adjacent to feeder street
  • Norm Normal
  • RRNn Within 200’ of North-South Railroad
  • RRAn Adjacent to North-South Railroad
  • PosN Near positive off-site feature–park, greenbelt, etc.
  • PosA Adjacent to positive off-site feature
  • RRNe Within 200’ of East-West Railroad
  • RRAe Adjacent to East-West Railroad

Here observationas are grouped by neighbourhood and mean sale price for each proximity condition is plotted.

## df$Condition1: Artery
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   66500  105000  119550  135092  143000  475000 
## -------------------------------------------------------- 
## df$Condition1: Feedr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40000  120825  139500  142256  167750  244600 
## -------------------------------------------------------- 
## df$Condition1: Norm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  131500  165800  183596  219428  625000 
## -------------------------------------------------------- 
## df$Condition1: PosA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  180000  188750  212500  225875  244000  335000 
## -------------------------------------------------------- 
## df$Condition1: PosN
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  109500  166125  206000  216875  257375  385000 
## -------------------------------------------------------- 
## df$Condition1: RRAe
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   87000  127750  142500  138400  156500  171000 
## -------------------------------------------------------- 
## df$Condition1: RRAn
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   79500  152394  171495  184397  190105  423000 
## -------------------------------------------------------- 
## df$Condition1: RRNe
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  187000  188875  190750  190750  192625  194500 
## -------------------------------------------------------- 
## df$Condition1: RRNn
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  110000  128000  214000  212400  290000  320000

while most of the affluent neighbourhoods had normal proximity to various conditions and people might be willing to pay for the houses there for various other factors, proximity to positive off-site features like park, greenbelt etc did matter to certain extent.

Did people pay premium for neighbourhood

Indeed, for same size of houses, sale price was higher for more affluent neighbourhood.

Age of the houses and size & rooms

##      
##         2   3   4   5   6   7   8   9  10  11  12  14
##   New   0   9  37 118 199 213 125  58  28  13   4   1
##   Old   1   8  60 157 203 116  62  17  17   4   6   0

Many of the bigger houses of all ages did have more rooms, with correlation of 0.54

Overall quality of bigger homes

Looks like bigger houses were in better condition too.

Bigger homes have more bathrooms?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5000  0.6250  0.6409  0.8333  2.0000

Correlation between the total number of bathrooms and the living area of a house is 0.71, which is quite high.

Older homes have more bathrooms?

Correlation between the total number of bathrooms and the age of a house is -0.45, which is moderate and negative.

Bivariate Analysis

Two main features of the dataset, SalePrice and OverallQual are highly correlated (correlation = 0.8) with each other. Some of the features have high to moderate correlation with SalePrice, like

As expected, there was a positive correlation between overall quality of the property and the year it was built, with more recent property having better overall quality as compared to older sructures. Not surprisingly, newer houses were more expensive.

While most of the houses had average or good kitchen, kitchen in excellent condition did influence the sale price of the house. It would be worthwhile to investigate this in terms of older houses fetching good price.

Quite a few expensive properties were near positive off-site feature–park, greenbelt, etc. People paid high premium for such features.

Although mostly old houses were cheaper, some of the houses did fetch good prices. Some of these houses were in the premium neighbourhoods. Also, older houses had fewer bathooms and as expected bigger houses had more bathrooms.

It was very interesting to observe that bigger houses were also in better condition, hence fetched better price too.

There was preference for some neighbourhoods like, Northridge, Northridge Heights and Stone Brook. These neighbourhoods had higer percentage of better quality houses and were expensive. Importantly, while these neighbourhoods with all new houses in good condition had pricier houses, few other neighbourhoods with new houses in good condition didn’t get good price. Looks like these are high end neighbourhoods!

There was a very strong relation between SalePrice and overall quality of the house, which seems quite logical. Also, houses with higher ground living area, meaning bigger houses, had higher sale price. Other features like age of the building and quality of kitchen also had good bearing on the sale price of the house.

Multivariate Plots Section

Sale price, living area and garage area

Sale price, living area and basement area

Sale price, living area and overall quality

Sale price, living area and kitchen quality

Kichen quality is converted to numeric here as follows: * Ex = 3 * Gd = 2 * TA = 1 * Fa = 0

Sale price, living area and neighbourhood

Sale price, living area and building age

Building built or remodelled before 20 years of it’s year of sale is taken as ‘Old’.

Sale price, year built and living area

Some of the old houses which was sold on higher price did have bigger living area

Sale price, year built and kitchen quality

Some of the old houses which was sold on higher price did have better quality kitchen.

Sale price, year built and garage area

Old houses which was sold on higher price had bigger garage.

Sale price, year built and basement area

Linear regression model

Linear regression model uses log transformed values of SalePrice and GrLivArea.

## 
## Calls:
## m1: lm(formula = I(log10(SalePrice)) ~ OverallQual, data = df)
## m2: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)), 
##     data = df)
## m3: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)) + 
##     YearBuilt, data = df)
## m4: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)) + 
##     YearBuilt + Neighborhood, data = df)
## m5: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)) + 
##     YearBuilt + Neighborhood + KitchenQual, data = df)
## 
## =======================================================================================================
##                                       m1            m2            m3            m4            m5       
## -------------------------------------------------------------------------------------------------------
##   (Intercept)                        4.595***      3.340***      0.478**       0.932***      1.332***  
##                                     (0.012)       (0.055)       (0.172)       (0.276)       (0.271)    
##   OverallQual                        0.103***      0.074***      0.053***      0.046***      0.040***  
##                                     (0.002)       (0.002)       (0.002)       (0.002)       (0.002)    
##   I(log10(GrLivArea))                              0.453***      0.508***      0.468***      0.455***  
##                                                   (0.019)       (0.018)       (0.017)       (0.017)    
##   YearBuilt                                                      0.001***      0.001***      0.001***  
##                                                                 (0.000)       (0.000)       (0.000)    
##   Neighborhood: Blueste/Blmngtn                                               -0.057        -0.032     
##                                                                               (0.052)       (0.050)    
##   Neighborhood: BrDale/Blmngtn                                                -0.112***     -0.094***  
##                                                                               (0.025)       (0.024)    
##   Neighborhood: BrkSide/Blmngtn                                                0.025         0.034     
##                                                                               (0.021)       (0.021)    
##   Neighborhood: ClearCr/Blmngtn                                                0.099***      0.102***  
##                                                                               (0.022)       (0.021)    
##   Neighborhood: CollgCr/Blmngtn                                                0.032         0.035*    
##                                                                               (0.018)       (0.017)    
##   Neighborhood: Crawfor/Blmngtn                                                0.101***      0.105***  
##                                                                               (0.021)       (0.020)    
##   Neighborhood: Edwards/Blmngtn                                               -0.004         0.004     
##                                                                               (0.020)       (0.019)    
##   Neighborhood: Gilbert/Blmngtn                                                0.006         0.018     
##                                                                               (0.019)       (0.018)    
##   Neighborhood: IDOTRR/Blmngtn                                                -0.052*       -0.046*    
##                                                                               (0.023)       (0.022)    
##   Neighborhood: MeadowV/Blmngtn                                               -0.059*       -0.054*    
##                                                                               (0.024)       (0.024)    
##   Neighborhood: Mitchel/Blmngtn                                                0.028         0.043*    
##                                                                               (0.020)       (0.019)    
##   Neighborhood: NAmes/Blmngtn                                                  0.036         0.045*    
##                                                                               (0.018)       (0.018)    
##   Neighborhood: NoRidge/Blmngtn                                                0.079***      0.087***  
##                                                                               (0.020)       (0.020)    
##   Neighborhood: NPkVill/Blmngtn                                               -0.012         0.008     
##                                                                               (0.029)       (0.028)    
##   Neighborhood: NridgHt/Blmngtn                                                0.089***      0.082***  
##                                                                               (0.019)       (0.018)    
##   Neighborhood: NWAmes/Blmngtn                                                 0.026         0.040*    
##                                                                               (0.019)       (0.019)    
##   Neighborhood: OldTown/Blmngtn                                               -0.009        -0.008     
##                                                                               (0.021)       (0.020)    
##   Neighborhood: Sawyer/Blmngtn                                                 0.036         0.045*    
##                                                                               (0.020)       (0.019)    
##   Neighborhood: Sawyer/BlmngtnW                                                0.013         0.018     
##                                                                               (0.019)       (0.019)    
##   Neighborhood: Somerst/Blmngtn                                                0.028         0.029     
##                                                                               (0.018)       (0.018)    
##   Neighborhood: StoneBr/Blmngtn                                                0.095***      0.091***  
##                                                                               (0.022)       (0.021)    
##   Neighborhood: SWISU/Blmngtn                                                  0.004         0.013     
##                                                                               (0.024)       (0.023)    
##   Neighborhood: Timber/Blmngtn                                                 0.063**       0.069***  
##                                                                               (0.020)       (0.020)    
##   Neighborhood: Veenker/Blmngtn                                                0.113***      0.115***  
##                                                                               (0.027)       (0.026)    
##   KitchenQual                                                                                0.036***  
##                                                                                             (0.004)    
## -------------------------------------------------------------------------------------------------------
##   R-squared                          0.671         0.760         0.802         0.842         0.851     
##   adj. R-squared                     0.671         0.760         0.801         0.839         0.848     
##   sigma                              0.099         0.084         0.077         0.069         0.067     
##   F                               2967.527      2305.971      1954.783       281.030       290.397     
##   p                                  0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood                  1306.993      1537.571      1674.635      1838.826      1881.858     
##   Deviance                          14.158        10.314         8.544         6.819         6.428     
##   AIC                            -2607.987     -3067.142     -3339.271     -3619.651     -3703.716     
##   BIC                            -2592.136     -3046.009     -3312.854     -3466.431     -3545.212     
##   N                               1456          1456          1456          1456          1456         
## =======================================================================================================

Variables OverallQual, GrLivArea, YearBuilt, Neighborhood and KitchenQual are able to account for approximately 84.8% of the variance in the sale price of the houses.

Multivariate Analysis

Size of various features of the property starting from the living area to garage and basement size had positive relation with the sale price of the property. There seemed to be positive correlation between the size of living area, garage and basement too, which reflected in the sale price. Year of built was highly correlated to the sale price and overall condition of the property.

It was intriguing to note that some of the old houses sold at good price. These old houses that had higher sale price were bigger houses with good quality kitchen and bigger garage.

Linear model created using log of sale price with features like overall condition, log of ground living area, neighbourhood, kitchen quality and year built is able to account for approximately 84.8% of the variance in the sale price of the houses.

This can definitely be improved by looking at other features like SaleCondition, SaleType and LotConfig at more depth. I think inclusion number of bathrooms can also help improving the model, after analysing it’s interaction with the living area and age of the building. Also, some of the features which were excluded from the project could help improve the model.


Final Plots and Summary

Plot One

Description One

There is a very high correlation between the sale price of the houses and its’ overall condition. It makes perfect sense, since no buyer would like to pay higher price for a property in bad shape.

Plot Two

Description Two

This plot highlights the strong correlaton between the two features, overall quality and living area, with the main feature, sale price. Interesting to note is the strong correlation between the features overall quality and living area too, with bigger houses having better condition.

Plot Three

Description Three

I think it was a natural to find this association between higher price for certain neighbourhoods. People paid high premium to be in certain neighbourhood even if the property was older.


Reflection

The Ames Housing dataset used here contains sales within Ames from 2006 to 2010, as described here by the author of the data. The dataset has too many variables and for the prupose of this project, I chose to only keep the variables that I felt a potential house buyer would look for before buying a house, like, size of the property, number of rooms, bathrooms, overall condition of the property, location etc.

I also chose to retain sale condition and sale type as I felt it might have some effect on the sale price. Preliminary analysis of the data showed that while most of the sales were under normal conditions, sale of new houses which were partially complete fetched higher price. Also, most of the sales were conventional warranty deeds or homes that were just constructed and sold, with new homes sold at higher prices.

Most of the houses in the dataset are average and better in quality at the time of sale. As I had expected, good quality houses have higher sale price and the quality of the house is better for newer houses. I went with my preliminary hunch that people would look for overall size of a house, including the quality of the kitchen and number of bathrooms plus maybe even the size of the garage. And indeed, while most of the houses had average or good kitchen, kitchen in excellent condition did influence the sale price of the house. People paid higher price for bigger garage area as well. Interestingly, bigger houses also were in better condition, thereby selling at higher prices.

My intial thought while looking at the neighbourhood and condition1 data, which was about proximity to certain features like green belts, parks etc, that maybe people would prefer neighbourhoods which fulfilled more of these criteria. But most of the observations did not show such preference, although pricy neighbourhoods did have these features.

Another feature that intrigued me was about some old houses fetching better prices in comparision. I looked at this aspect from different prepective, starting from their condition at the time of the sale, the neighbourhood it belonged to, size ofthe house and even number of bathrooms. Finally, it turned out that apart from other things, these houses were indeed in better condition when sold.

Few features that I could have analysed further were number of bathrooms, lot configuration, sale condition and sale type. I could see there was some interaction between number of bathrooms and age of houses. Also, sale price was positively correlated with the number of bathrooms only till certain number. Abnormal sale condition also seemed to influence the sale price. Further, I noticed that plots with frontage open from three sides and the ones in cul-de-sacs had better sale price. These aspecs can be further investgated and maybe included in th model too.

References